Method and system for an automatic sensing, analysis, composition and direction of a 3d space, scene, object, and equipment

ABSTRACT

Method and system for automatic composition and orchestration of a 3D space or scene using networked devices and computer vision to bring ease of use and autonomy to a range of compositions. A scene, its objects, subjects and background are identified and classified, and relationships and behaviors are deduced through analysis. Compositional theories are applied, and context attributes (for example location, external data, camera metadata, and the relative positions of subjects and objects in the scene) are considered automatically to produce optimal composition and allow for direction of networked equipment and devices. Events inform the capture process, for example, a video recording initiated when a rock climber waves her hand, an autonomous camera automatically adjusting to keep her body in frame throughout the sequence of moves. Model analysis allows for direction, including audio tones to indicate proper form for the subject and instructions sent to equipment ensure optimal scene orchestration.

CROSS REFERENCE TO RELATED APPLICATION

The instant application is a utility application of the previously filedU.S. Provisional Application 62/053,055 filed on 19 Sep. 2014. Thepending U.S. Provisional Application 62/053,055 is hereby incorporatedby reference in its entireties for all of its teachings.

FIELD OF INVENTION

A method and system for automatically sensing using photographicequipment that captures a 3D space, scene, subject, object, andequipment for further analysis, composition and direction that can beused for creating visual design.

BACKGROUND

Computer device hardware and software continue to advance insophistication. Cameras, micro controllers, computer processors (e.g.,ARM), and smartphones have become more capable, as well as smaller,cheaper, and ubiquitous. In parallel, more sophisticated algorithmsincluding computer vision, machine learning and 3D models can becomputed in real-time or near real-time on a smartphone or distributedover a plurality of devices over a network.

At the same time, multiple cameras including front-facing cameras onsmartphones have enabled the popularity of the selfie as a way foranyone to quickly capture a moment and share it with others. But theprimary mechanism for composition has not advanced beyond an extendedarm or a selfie stick and use of the device's screen as a visualreference for the user to achieve basic scene framing. Recently, therehave been GPS-based drone cameras introduced such as Lily that improveon the selfie-stick, but they are not autonomous and instead require theuser to wear a tracking device to continually establish the focal pointof the composition and pass directional “commands” to the drone viabuttons on the device. This is limiting when trying to include multipledynamic subjects and or objects in the frame (a “groupie”), or when theuser is preoccupied or distracted (for example at a concert, or whileengaged in other activities).

SUMMARY

The present invention is in the areas of sensing, analytics, direction,and composition of 3D spaces. It provides a dynamic real-time approachto sense, recognize, and analyze objects of interest in a scene; appliesa composition model that automatically incorporates best practices fromprior art as models, for example: photography, choreography,cinematography, art exhibition, and live sports events; and directssubjects and equipment in the scene to achieve the desired outcome.

In one embodiment, a high-quality professional-style recording is beingcomposed using the method and system. Because traditional and ubiquitousimage capture equipment can now be enabled with microcontrollers and/orsensor nodes in a network to synthesize numerous compositional inputsand communicate real-time directions to subjects and equipment using acombination of sensory (e.g., visual, audio, vibration) feedback andcontrol messages, it becomes significantly easier to get a high-qualityoutput on one's own. If there are multiple people or subjects who needto be posed precisely, each subject can receive personalized directionto ensure their optimal positioning relative to the scene around them.

In one embodiment, real-world scenes are captured using sensor data andtranslated into 2D, 2.5 D and 3D models in real-time using a method suchthat continuous spatial sensing, recognition, composition, and directionis possible without requiring additional human judgment or interactionwith the equipment and/or scene.

In one embodiment, image processing, image filtering, video analysismotion, background subtraction, object tracking, pose, stereocorrespondence, and 3D reconstruction are run perpetually to provideoptimal orchestration of subjects and equipment in the scene without ahuman operator.

In one embodiment, subjects can be tagged explicitly by a user, ordetermined automatically by the system. If desired, subjects can betracked or kept in frame over time and as they move throughout a scene,without further user interaction with the system. The subject(s) canalso be automatically directed through sensory feedback (e.g., audio,visual, vibration) or any other user interface.

In one embodiment as a method, an event begins the process of capturingthe scene. The event can be an explicit hardware action such as pressinga shutter button or activating a remote control for the camera, or theevent can be determined via software, a real world event, message ornotification symbol; for example recognizing the subject waving theirarms, a hand gesture or an object, a symbol, or identified subject orentity entering a predetermined area in the scene.

The system allows for the identification of multiple sensory eventtypes, including physical-world events (object entering/exiting theframe, a sunrise, a change in the lighting of the scene, the sound ofsomeone's voice, etc.) and software-defined events (state changes,timers, sensor-based). In one embodiment, a video recording is initiatedwhen a golfer settles into her stance and aims her head down, and thecamera automatically adjusts to keep her moving club in the frame duringher backswing before activating burst mode so as to best capture themoment of impact with the ball during her downswing before pausing therecording seconds after the ball leaves the frame. Feedback can befurther provided to improve her swing based on rules and constraintsprovided from an external golf professional, while measuring and scoringhow well she complies with leading practice motion ranges.

In another embodiment, a video or camera scan can be voice orautomatically initiated when the subject is inside the camera frame andmonitor and direct them through a sequence of easy to understand bodymovements and steps with a combination of voice, lights and by simplemimic of on-screen poses as in a user interface or visual display. For afew examples, the subject could be practicing and precisely evaluatingyoga poses, following a physical therapy program, or taking private bodymeasurements.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A, 1B, 1C, 1D, 1E show various methods for hands-free capture of ascene.

FIG. 2A, 2B, 2C, 2D, 2E, 2F, 2G, 2H, 2I, 2J illustrate a system ofcameras that can be used to implement a method of sensing, analyzing,composing, and directing a scene.

FIG. 3 shows examples of interfaces, inputs, and outputs to directsubjects in a scene.

FIG. 4 shows further examples of interface when capturing close-upscenes.

FIG. 5 shows examples of selfies, groupies, and other typicalapplications.

FIG. 6 is a diagram of the Sensing Module, Analytics Module,Composition/Architecture Module, and Direction/Control Module.

FIG. 7 is a diagram of the system's algorithm and high-level processflow.

FIG. 8 is a detailed look at the Sensing Module from FIG. 6.

FIG. 9 is a high level view of the system architecture for on-premiseand cloud embodiments.

FIG. 10 illustrates various iconic and familiar compositions andreference poses.

FIG. 11 shows an interface for choosing a composition model andassigning objects or subjects for direction.

FIG. 12 shows further examples of compositions that can be directed.

FIG. 13 shows an example interface for using data attached to specificgeolocations, as well as an example use case.

FIG. 14 shows how computer vision can influence composition modelselection.

FIGS. 15A and 15B show examples of Building Information Management (BIM)applications.

FIG. 16 shows how a collection of images and file types can beconstructed and deconstructed into sub-components including 3D aggregatemodels and hashed files for protecting user privacy across system fromdevice to network and cloud service.

FIG. 17 shows types of inputs that inform the Models from FIG. 6

FIG. 18 shows a method for virtual instruction to teach how to playmusic

FIG. 19 is an example of how a Model can apply to Sensed data.

FIG. 20 shows example connections to the network and to the ProcessingUnit.

DETAILED DESCRIPTION

The present invention enables real-time sensing, spatial composition,and direction for objects, subjects, scenes, and equipment in 2D, 2.5Dor 3D models in a 3D space. In a common embodiment, a smartphone will beused for both its ubiquity and the combination of cameras, sensors, andinterface options.

FIG. 1A shows how such a cell phone (110) can be positioned to providehands-free capture of a scene. This can be achieved using supplementalstands different from traditional tripods designed for non-phonecameras. FIG. 1C shows a stand can be either foldable (101) or rigid(102) so long as it holds the sensors on the phone in a stable position.A braced style of stand (103) like the one shown in FIG. 1E can also beused. The stand can be made of any combination of materials, so long asthe stand is sufficiently tall and wide as to support the weight of thecapturing device (110) and hold it securely in place.

In an embodiment, the self-assembled stand (101) can be fashioned frommaterials included as a branded or unbranded removable insert (105) in amagazine or other promotion (106) with labeling and tabs sufficient sothat the user is able to remove the insert (105) and assemble it into astand (101) without any tools. This shortens the time to initial use byan end-user by reducing the steps needed to position a device for propercapture of a scene.

As seen in FIG. 1D, the effect of the stand can also be achieved usingthe angle of the wall/floor and the natural surface friction of a space.In this embodiment, the angle of placement (107) is determined by thephone's (110) sensors and slippage can be detected by monitoring changesin those sensors. The angle of elevation can be extrapolated from thecamera's lens (111), allowing for very wide capture of a scene when thephone is oriented in portrait mode. Combined with a known fixed positionfrom the bottom of the phone to the lens (104), the system is now ableto deliver precise measurements and calibrations of objects in a scene.This precision could be used, for example, to capture a user'smeasurements and position using only one capture device (110) instead ofmultiple.

When positioning the device on a door, wall, or other vertical surface(FIG. 1A), adhesive or magnets (120) can be used to secure the capturedevice (110) and prevent it from falling. For rented apartments or othertemporary spaces, the capture device can also be placed in a case (122)such that the case can then be mounted via hooks, adhesives, magnets, orother ubiquitous fasteners (FIG. 1B). This allows for easy removal ofthe device (110) without compromising or disturbing the capture location(121).

Referring now to FIG. 2A, 2B, 2C, 2D, 2E, 2F, 2G, 2H, 2I, 2J, variousdevices can be orchestrated into an ensemble to capture a scene.Existing capture device types can be positioned and networked to providean optimal awareness of the scene. Examples include: cameras (202),wearable computers such as the Apple Watch, Google Glass, or FitBit(FIG. 2B), pan/tilt cameras such as those found in webcams/securitycameras (FIG. 2C), mobile devices such as smartphones or tablets (FIG.2D) equipped with front and rear-facing cameras (including advancedmodels with body-tracking sensors and fast-focus systems such as in theLG G3), traditional digital cameras (FIG. 2E), laptops with integratedwebcams (FIG. 2F), depth cameras or thermal sensors (FIG. 2G) like thosefound in the Xbox Kinect hardware, dedicated video cameras (FIG. 2H),and autonomous equipment with only cameras attached (FIG. 2I) orautonomous equipment with sensors (FIG. 2J) such as sonar sensors, orinfrared, or laser, or thermal imaging technology.

Advances in hardware/software coupling on smartphones further extend theapplicability of the system and provide opportunities for a better userexperience when capturing a scene because ubiquitous smartphones andtablets (FIG. 2D) can increasingly be used instead of traditionallyexpensive video cameras (FIG. 2E, FIG. 2H).

Using the mounts described in FIG. 1A, a device (110) can be mounted ona door or wall to capture the scene. The door allows panning of thescene by incorporating the known fixed-plane movement of the door. Foralternate vantage points, it is also possible to use the mounts toposition a device on a table (213) or the floor using a stand (103), orto use a traditional style tripod (215). The versatility afforded by themounts and stands allows for multiple placement options for capturingdevices, which in turn allows for greater precision and flexibility whensensing, analyzing, composing, and directing a subject in a 3D space.

Once recognized in a scene, subjects (220) can then be directed via thesystem to match desired compositional models, according to varioussensed orientations and positions. These include body alignment (225),arm placement (230), and head tilt angle (234). Additionally, thesubject can be directed to rotate in place (235) or to change theirphysical location by either moving forward, backward, or laterally(240).

Rotation (225) in conjunction with movement along a plane (240) alsoallows for medical observation, such as orthopedic evaluation of auser's gait or posture. While an established procedure exists todaywherein trained professional humans evaluate gait, posture, and otherattributes in-person, access to those professionals is limited and thequality and consistency of the evaluations is irregular. The inventionaddresses both shortcomings through a method and system that makes useof ubiquitous smartphones (110) and the precision and modularity ofmodels. Another instance where networked sensors and cameras can replacea human professional is precise body measurement, previously achieved byvisiting a quality tailor. By creating a 3D scene and directing subjects(220) intuitively as they move within it, the system is able to ensurewith high accuracy that the subjects go through the correct sequencesand the appropriate measurements are collected efficiently and withrepeatable precision. Additionally, this method of dynamic and precisecapture of a subject while sensing can be used to achieve positioningrequired for stereographic images with e.g., a single lens or sensor.

FIG. 3 provides examples of interface possibilities to communicatefeedback to the subjects and users. The capturing device (110) can relayfeedback that is passed to subjects through audio tones (345), voicecommands (346), visually via a screen (347), or using vibration (348).An example of such a feedback loop is shown as a top view looking downon the subject (220) as they move along the rotation path directed in(225) according to audio tones heard by the subject (349).

The visual on-screen feedback (347) can take the form of a superimposedimage of the subject's sensed position relative to the directed positionin the scene (350). In one embodiment, the positions are represented asavatars, allowing human subjects to naturally mimic and achieve thedesired position by aligning the two avatars (350). Real-time visualfeedback is possible because the feedback-providing device (110) isnetworked (351) to all other sensing devices (352), allowing forsynthesis and scoring of varied location and position inputs andproviding a precise awareness of the scene's spatial composition (thismethod and system is discussed further in FIG. 8). One example ofadditional sensing data that can be networked is imagery of an infraredcamera (360).

Other devices such as Wi-Fi-enabled GoPro®-style action cameras (202)and wearable technologies such as a smart watch with a digital displayscreen (353) can participate in the network (351) and provide the sametypes of visual feedback (350). This method of networking devices forcapturing and directing allows individuals to receive communicationsaccording to their preferences on any network-connected device such as,but not limited to, a desktop computer (354), laptop computer (355),phone (356), tablet (357), or other mobile computer (358).

FIG. 4 provides examples of an interface when the screen is not visible,for example because the capture device is in too close of proximity tothe subject. If the capture device is a smartphone (110) oriented toproperly capture a subject's foot (465), it is unlikely that the subjectwill have sufficient ability to interact with the phone's screen, andthere may not be additional devices or screens available to displayvisual feedback to the user.

The example in (466) shows how even the bottom of a foot (471) can becaptured and precise measurements can be taken using a smartphone (110).By using the phone's gyroscope, the phone's camera can be directed tobegin the capture when the phone is on its back, level, and the foot iscompletely in frame. No visual feedback is required and the systemcommunicates direction such as rotation (470) or orientation changes(473, 474) through spoken instructions (446) via the smartphone'sspeakers (472).

Multiple sensory interface options provide ways to make the system moreaccessible, and allow more people to use it. In an embodiment, a usercan indicate they do not want to receive visual feedback (because theyare visually impaired, or because the ambient lighting is too bright, orfor other reasons) and their preference can be remembered, so that theycan receive feedback through audio (446) and vibration (448) instead.

Referring now to FIG. 5, examples of different types of scenes are shownto indicate how various compositional models can be applied.Traditionally, sensing, analytics, composition, and direction have beenmanual processes. The selfie shown in (501) is a photo or videotypically difficult to capture by the operator at his or her arm lengthand/or reliant on a front-facing camera so immediate visual feedback isprovided. Absent extensive planning and rehearsal, an additional humanphotographer has previously been required to achieve well-composed scenecapture as seen in (502) and (503). Compositions with small children(504) or groups (505) represent further examples of use cases that aretraditionally difficult to achieve without a human camera operator,because of the number of subjects involved and the requirement that theybe simultaneously directed into precise poses.

Additionally, sports-specific movements such as those in soccer (506)(goal keeper positioning, shoot on goal, or dribbling and juggling form)and activities like baseball (507) (batting, fielding, catching),martial arts (508), dance (509), or yoga (510) are traditionallydifficult to self-capture as they require precise timing and the subjectis preoccupied so visual feedback becomes impractical. Looking again at(506), the ball may only contact the athlete's foot for a short amountof time, so the window for capture is correspondingly brief. Theexisting state of the art to capture such images is to record highdefinition, high-speed video over the duration of the activity andgenerate stills afterward, often manually. This is inefficient andcreates an additional burden to sift through potentially large amountsof undesired footage.

A method and system for integrating perpetual sensor inputs, real-timeanalytics capabilities, and layered compositional algorithms (discussedfurther in FIG. 6) provide a benefit to the user through the form ofautomatic direction and orchestration without the need for additionalhuman operators. In one embodiment, sports teams' uniforms can contain adesignated symbol for sensing specific individuals, or existing uniformnumbers can be used with CV and analytics methods to identifyparticipants using software. Once identified, the system can use thesemarkers for both identification and editing to inform capture, as wellas for direction and control of the subjects.

In another embodiment, the system can use the order of the images toinfer a motion path and can direct participants in the scene accordingto a compositional model matched from a database. Or, the imagesprovided can be inputted to the system as designated “capture points”(516) or moments to be marked if they occur in the scene organically.This type of system for autonomous capture is valuable because itsimplifies the post-capture editing/highlighting process by reducing theamount of waste footage captured initially, as defined by the user.

In another embodiment, static scenes such as architectural photography(518) can also be translated from 2D to 3D. The method for recordingmodels for interior (517) and exterior (518) landscapes by directing thehuman user holding the camera can standardize historically individuallycomposed applications (for example in real estate appraisals, MLSlistings, or promotional materials for hotels). Because the system iscapable of self-direction and provides a method for repeatable,autonomous capture of high quality visual assets by sensing, analyzing,composing, and directing, the system allows professionals in theabove-mentioned verticals to focus their efforts not on orchestratingthe perfect shot but on storytelling.

In another embodiment, mounted cameras and sensors can provideinformation for Building Information Modeling (BIM) systems. Providingreal-time monitoring and sensing allows events to be not only tagged butalso directed and responded to, using models that provide moregranularity than is traditionally available. In one embodiment,successful architectural components from existing structures can evolveinto models that can inform new construction, direct buildingmaintenance, identify how people are using the building (e.g., trafficmaps), and can optimize HVAC or lighting, or adjust other environmentsettings.

As their ubiquity drives their cost down, cameras and sensors used forcreating 3D building models will proliferate. Once a 3D model of abuilding has been captured (517), the precise measurements can be sharedand made useful to other networked devices. As an example, the state ofthe art now is for each device to create its own siloes of information.Dyson's vacuum cleaner The Eye, for example, captures multiple 360images each second on its way to mapping a plausible vacuuming routethrough a building's interior, but those images remain isolated andaren't synthesized into a richer understanding of the physical space.Following 3D space and markers using relative navigation of modelparameters and attribute values is much more reliable and less costly,regardless of whether image sensing is involved.

In another embodiment, the system can pre-direct a 3D scene via a seriesof 2D images such as a traditional storyboard (515). This can beaccomplished by sensing the content in the 2D image, transforming sensed2D content into a 3D model of the scene, objects, and subjects, andultimately assigning actors roles based on the subjects and objects theyare to mimic. This transformation method allows for greatercollaboration in film and television industries by enabling thepossibility of productions where direction can be given to actorswithout the need for actors and directors to be in the same place at thesame time, or speak a common language.

FIG. 6 shows the method of the invention, including the identificationof foundational components including Objects, Subjects, Scene, Scape,and Equipment (601).

Once the capture process has been started (602), pre-sensed contexts andcomponents (Object(s), Subject(s), Scene, Scape, Equipment) (601) arefed into the Sensing Module (603). Now both physical and virtualanalytics such as computer vision (i.e., CV) can be applied in theAnalytics Module (604) to make sense of scene components identified inthe Sensing Module (603). And they can be mapped against compositionmodels in the Composition/Architecture Module (605) so that in anembodiment, a subject can be scored for compliance against a knowncomposition or pattern. Pre-existing models can be stored in a Database(600) that can hold application states and reference models, and thosemodels can be applied at every step of this process. Once the analysishas taken place comparing sensed scenes to composed scenes, direction ofthe components of the scene can occur in the Direction/Control Module(606) up to and including control of robotic or computerized equipment.Other types of direction include touch-UI, voice-UI, display, controlmessage events, sounds, vibrations, and notifications. Equipment can besimilarly directed via the Direction/Control Module (606) toautomatically and autonomously identify a particular subject (e.g., abaseball player) in conjunction with other pattern recognition (such asa hit, 507), allowing efficient capture of subsets in frame only. Thiscan provide an intuitive way for a user to instruct the capture of ascene (e.g., begin recording when #22 steps up to the plate, and saveall photos of his swing, if applicable).

The Sensing Module (603) can connect to the Analytics Module (604) andthe Database (600), however the Composition/Architecture Module (605)and Direction/Control Module (606) can connect to the Analytics Module(604) and the Database (600) as shown in FIG. 6.

In another embodiment, the capability gained from pairing the system'sSensing Module (603) and Analytics Module (604) with itsComposition/Architecture Module (605) and Direction/Control Module (606)allows for on-demand orchestration of potentially large numbers ofpeople in a building, for example automatically directing occupants tosafety during an emergency evacuation such as a fire. The Sensing Module(603) can make sense of inputs from sources including security cameras,proximity sensors such as those found in commercial lighting systems,and models stored in a database (600) (e.g., seating charts, blueprints,maintenance schematics) to create a 3D model of the scene and itssubjects and objects. Next, the Analytics Module (604) can use layeredCV algorithms such as background cancellation to deduce, for example,where motion is occurring. The Analytics Module (604) can also runfacial and body recognition processes to identify human subjects in thescene, and can make use of ID badge reading hardware inputs to linksensed subjects to real-world identities. The Composition/ArchitectureModule (605) can provide the optimal choreography model for theevacuation, which can be captured organically during a previous during afire drill at this location, or can be provided to the system in theform of an existing “best practice” for evacuation. All three modules(Sensing Module (603), Analytics Module (604), andComposition/Architecture Module (605)) can work in a feedback loop toprocess sensed inputs, make sense of them, and score them against theideal compositional model for the evacuation. Additionally, theDirection/Control Module (606) can provide feedback to the evacueesusing the methods and system described in FIG. 3 and FIG. 4. TheDirection/Control Module (606) can also, for example, shut off the gasline to the building if it has been properly networked beforehand.Because the Sensing Module (603) is running continuously, the system iscapable of sensing if occupants are not complying with the directionsbeing given from the Direction/Control Module (606). The benefits ofautomatically synthesizing disparate inputs into one cohesive scene isalso evident in this example of an emergency evacuation, as infraredcamera inputs allow the system to identify human subjects using acombination of CV algorithms and allow the system to direct them to thecorrect evacuation points, even if the smoke is too thick fortraditional security cameras to be effective, or the evacuation pointsare not visible. The Direction/Control Module (606) can also dynamicallyswitch between different styles of feedback, for example if high ambientnoise levels are detected during the evacuation, feedback can beswitched from audio to visual or haptic.

FIG. 7 is a process flow for a process, method, and system for automaticorchestration, sensing, composition and direction of subjects, objectsand equipment in a 3D space. Once started (700), any real world event(701) from a user pushing a button on the software UI to some specificevent or message received by the application can begin the captureprocess and the Sensing Module (603). This sensing can be done by asingle sensor for example infrared or sonic sensor device (702) or froma plurality of nodes in a network that could also include a combinationof image sensing (or camera) nodes (703).

To protect subject privacy and provide high levels of trust in thesystem, traditional images are neither captured nor stored, and onlyobfuscated points clouds are recorded by the device (704). Theseobfuscated points clouds are less identifiable than traditionalcamera-captured images, and can be encrypted (704). In real-time as thisdata is captured at any number of nodes and types, either by a set ofdevice local (e.g., smartphone) or by a cloud based service, a dynamicset of computer vision modules (i.e., CV) (705) and machine learningalgorithms (ML) are included and reordered as they are applied tooptimally identify the objects and subjects in a 3D or 2D space. Anexternal to the invention “context system” (706) can concurrentlyprovide additional efficiency or speed in correlating what's beingsensed with prior composition and/or direction models. Depending on theresults from the CV and on the specific use-case, the system cantransform the space, subjects and objects into a 3D space with 2D, 2.5 Dor 3D object and subject models (707).

In some use-cases, additional machine learning and heuristic algorithms(708) can be applied across the entire system and throughout processesand methods, for example to correlate the new space being sensed withmost relevant composition and or direction models or to provide otherapplications outside of this application with analytics on this newdata. The system utilizes both supervised and unsupervised machinelearning in parallel and can run in the background to provide context(706) around, for example, what CV and ML methods were implemented mostsuccessfully. Supervised and unsupervised machine learning can alsoidentify the leading practices associated with successful outcomes,where success can be determined by criteria from the user, or expert orsocial feedback, or publicly available success metrics. For performance,the application can cache in memory most relevant composition model(s)(710) for faster association with models related to sensing anddirection. While monitoring and tracking the new stored sensed data(711), this can be converted and dynamically updated (712) into a newunique composition model if the pattern is unique, for example asdetermined automatically using statistical analysis, ML, or manuallythrough a user/expert review interface.

In embodiments where a user is involved in the process, the applicationcan provide continual audio, speech, vibration or visual direction to auser (715) or periodically send an event or message to an application onthe same or other device on the network (716) (e.g., a second camera tobegin capturing data). Direction can be sporadic or continuous, can bespecific to humans or equipment, and can be given using the approachesand interfaces detailed in FIG. 3

As the application monitors the processing of the data, it utilizes afeedback loop (720) against the composition or direction model and willadjust parameters and loop back to (710) or inclusion of softwarecomponents and update dynamically on a continuous basis (721). Newcomposition models will be stored (722) whether detected by the softwareor defined by user or expert through a user interface (723). New and oldcomposition models and corresponding data are managed and versioncontrolled (724).

By analyzing the output from the Sensing Module (603), the system candynamically and automatically utilize or recommend a relevant storedcomposition model (725) and direct users or any and all equipment ordevices from this model. But in other use cases, the user can manuallyselect a composition model from those previously stored (726).

From the composition model, the direction model (727) provides events,messages, and notifications, or control values to other subjects,applications, robots or hardware devices. Users and/or experts canprovide additional feedback as to the effectiveness of a direction model(728), to validate, augment or improve existing direction models. Thesemodels and data are version controlled (729).

In many embodiments, throughout the process the system can sporadicallyor continuously provide direction (730), by visual UI, audio, voice,vibration to user(s) or control values by event or message to networkeddevices (731) (e.g., robotic camera dolly, quadcopter drone, pan andtilt robot, Wi-Fi-enabled GoPro®, etc.).

Each process throughout the system can utilize a continuous feedbackloop as it monitors, tracks, and reviews sensor data against trainingset models (732). The process can continuously compute and loop back to(710) in the process flow and can end (733) on an event or message fromexternal or internal application or input from a user/expert through aUI.

FIG. 8 is a process flow for the Sensing Module (603) of the system,which can be started (800) by a user through UI or voice command, or bysensing a pattern in the frame (801) or by an event in the application.A plurality of sensors capture data into memory (802) and through acombination of machine learning and computer vision sensing andrecognition processing, entities, objects, subjects and scenes can berecognized (803). They also will identify most strongly correlated modelto help make sense of the data patterns being sensed against (804)previously sensed models stored in a Database (600), via a feedback loop(815). In one embodiment, the image sensor (804) will be dynamicallyadjusted to improve the sensing precision, for example, separating aforeground object or subject from the background in terms of contrast. Areference object in either 2D or 3D can be loaded (805) to helpconstrain the CV and aid in recognition of objects in the scene. Using areference object to constrain the CV helps the Sensing Module (603)ignore noise in the image including shadows and non-target subjects, aswell as objects that might enter or exit the frame.

Other sensors can be used in parallel or serially to improve the contextand quality of sensing (806). For example, collecting the transmittedgeolocation positions from their wearable devices or smartphones of thesubjects in an imaged space can help provide richer real-time sensingdata to other parts of the system, such as the Composition Module (605).Throughout the processes, the entity, object and scene capturevalidation (807) is continuously evaluating what, and to what level ofconfidence, in the scene is being captured and what is recognized. Thisconfidence level of recognition and tracking is enhanced as otherdevices and cameras are added to the network because their inputs andsensory capabilities can be shared and reused and their various screensand interface options can be used to provide rich feedback and direction(FIG. 3).

The sensing process might start over or move onto a plurality anddynamically ordered set of computer vision algorithm components (809)and/or machine learning algorithms components (810). In variousembodiments, those components can include, for example, blob detectionalgorithms, edge detection operators such as Canny, and edge histogramdescriptors. The CV components are always in a feedback loop (808)provided by previously stored leading practice models in the Database(600) and machine learning processes (811). In an embodiment, imagesensing lens distortion (i.e., smartphone camera data) can be errorcorrected for barrel distortion and the gyroscope and compass can beused to understand the context of subject positions to a 3D spacerelative to camera angles (812). The system can generate 3D models fromthe device or networked service or obfuscated and/or encrypted pointclouds (813). These point clouds or models also maintained in a feedbackloop (814) with pre-existing leading practice models in the Database(600).

A broader set of analytics and machine learning can be run against allmodels and data (604). The Sensing Module (603) is described earlier inFIG. 6 and a more detailed process flow is outlined here in FIG. 8. Aspowerful hardware is commercialized and further capabilities areunlocked via APIs, the system can correlate and analyze the increasedsensor information to augment the Sensing Module (603) and providegreater precision and measurement of a scene.

FIG. 9 is a diagram of the architecture for the system (950) accordingto one or more embodiments. In an on-premise embodiment, the ProcessingUnit (900) is comprised of the Sensing Module (603), Analytics Module(604), Composition/Architecture Module (605), and Direction/ControlModule (606), can be connected to a processor (901) and a device localdatabase (600) or created in any other computer medium and connected tothrough a network (902) including being routed by a software definednetwork (i.e., SDN) (911). The Processing Unit (900) can also beconnected to an off-premise service for greater scale, performance andcontext by network SDN (912). This processing capability service cloudor data center might be connected by SDN (913) to a distributed filesystem (910) (e.g., HDFS with Spark or Hadoop), a plurality of serviceside databases (600) or cloud computing platform (909). In one or moreembodiments, the Processing Unit can be coupled to a processor inside ahost data processing system (903) (e.g., a remote server or localserver) through a wired interface and/or a wireless network interface.In another embodiment, processing can be done on distributed devices foruse cases requiring real-time performance (e.g., CV for capturing asubject's position) and that processing can be correlated with otherprocessing throughout the service (e.g., other subject's positioning inthe scene).

FIG. 10 shows examples of iconic posing and professional compositions,including both stereotypical model poses (1000) and famous celebrityposes such as Marilyn Monroe (1001). These existing compositions can beprovided to the system by the user and can be subsequently understood bythe system, such that subjects can then be auto-directed to poserelative to a scene that optimally reproduces these compositions, withfeedback given in real-time as the system determines all precise spatialorientation and compliance with the model.

In one embodiment, a solo subject can also be directed to pose in thestyle of professional models (1002), incorporating architecturalfeatures such as walls and with special attention given to precise hand,arm, leg placement and positioning even when no specific image isproviding sole compositional guidance or reference. To achieve this, thesystem can synthesize multiple desirable compositions from a database(600) into one composite reference composition model. The system alsoprovides the ability to ingest existing 2D art (1006) which is thentransformed into a 3D model used to auto-direct composition and can actas a proxy for the types of scene attributes a user might be able torecognize but not articulate or manually program.

In another embodiment, groups of subjects can be automatically directedto pose and positioned so that hierarchy and status are conveyed (1010).This can be achieved using the same image synthesis method and system asin (1002), and by directing each subject individually and while posingthem relative to each other to ensure compliance with the referencemodel. The system's simultaneous direction of multiple subjects in framecan dramatically shorten the time required to achieve a qualitycomposition. Whereas previously a family (1005) would have usedtime-delay and extensive back-and-forth positioning or enlisted aprofessional human photographer, now the system is able to direct themand reliably execute the ideal photograph at the right time and usingubiquitous hardware they already own (e.g., smartphones). The system isable to make use of facial recognition (1007) to deliver specificdirection to each participant, in this embodiment achieving optimalpositioning of the child's arm (1008,1009). In another embodiment, thesystem is able to direct a kiss (1003) using the Sensing Module (603),Analytics Module (604), Composition/Architecture Model (605), andDirection/Control Module (606) and the method described in FIG. 7 toensure both participants are in compliance with the choreography modelthroughout the activity. The system is also able to make use of sensedbehaviors as triggers for other events, so that in one embodiment adancer's movements can be used as inputs to direct the composition oflive music, or in another embodiment specific choreography can be usedto control the lighting of an event. This allows experts orprofessionals to create models to be emulated by others (e.g., forinstruction or entertainment).

FIG. 11 is provided as an example of a consumer-facing UI for the systemthat would all for assignment of models to scenes (1103), and of rolesto subjects (1100) and objects. Virtual subject identifiers can besuperimposed over a visual representation of the scene (1101) to provideauto-linkage from group to composition, and allows for intuitivedragging and reassignments (1105). Sensed subjects, once assigned, canbe linked to complex profile information (1104) including LinkedIn,Facebook, or various proprietary corporate LDAP or organizationalhierarchy information. Once identified, subjects can be directedsimultaneously and individually by the system, through the interfacesdescribed in FIG. 3.

In scenarios where distinguishing between subjects is difficult (poorlight, similar clothing, camouflage in nature) stickers or other markerscan be attached to the real-world subjects and tagged in this manner.Imagine a distinguishing sticker placed on each of the five subjects(901) and helping to keep them correctly identified. These stickers ormarkers can be any sufficiently differentiated pattern (includingstripes, dots, solid colors, text) and can be any material, includingsimple paper and adhesive, allowing them to come packaged in themagazine insert from FIG. 1 (105) as well.

FIG. 12 provides further examples of compositions difficult to achievetraditionally, in this case because of objects or entities relative tothe landscape of the scene. Nature photography in particular poses achallenge due to the uncontrollable lighting on natural features such asmountains in the background versus the subject in the foreground (1200).Using the interface described in FIG. 11, users are able to create rulesor conditions to govern the capture process and achieve the idealcomposition with minimal waste and excess. Those rules can be used tosuggest alternate compositions or directions if the desired outcome isdetermined to be unattainable, for example because of weather.Additionally, existing photographs (1201) can be captured by the system,as a method of creating a reference model. In one embodiment,auto-sensing capabilities described in FIG. 8 combined withcompositional analysis and geolocation data can deliver specificuser-defined outcomes such as a self-portrait facing away from thecamera, executed when no one else is in the frame and the clouds areover the trees (1202). In another embodiment, the composition model isable to direct a subject to stand in front of a less visually “busy”section of the building (1203).

Much of the specific location information the system makes use of toinform the composition and direction decisions is embodied in a locationmodel, as described in FIG. 13. Representing specific geolocations(1305), each pin (1306) provides composition and direction for camerasettings and controls, positioning, camera angles (1302), architecturalfeatures, lighting, and traffic in the scene. This information issynthesized and can be presented to the user in such a way that thecompositional process is easy to understand and highly automated, whiledelivering high quality capture of a scene. For example, consider atypical tourist destination that can be ideally composed (1307)involving the Arc de Triomphe. The system is able to synthesize a widerange of information (including lighting and shadows depending ondate/time, weather, expected crowd sizes, ratings of comparableiterations of this photo taken previously) which it uses to suggestdesirable compositions and execute them with precision and reliability,resulting in a pleasant and stress-free experience for the user.

FIG. 14 is a representation of computer vision and simple modelsinforming composition. A building's exterior (1401) invokes aperspective model (1402) automatically to influence composition througha CV process and analytics of images during framing of architecturalphotography. The lines in the model (1402) can communicate idealperspective to direct perspective, depth, and other compositionalqualities, to produce emotional effects in architectural photographyapplications such as real estate listings.

Referring now to FIG. 15A, the system can make use of a smartphone orcamera-equipped aerial drone (1500) to perform surveillance and visualinspection of traditionally difficult or dangerous-to-inspect structuressuch as bridges. Using 3D-constrained CV to navigate and control thedrone autonomously and more precisely than traditional GPS waypoints,the system can make use of appropriate reference models for weldinspections, corrosion checks, and insurance estimates and damageappraisals. Relative navigation based on models of real-world structures(1502) provides greater flexibility and accuracy when directingequipment when compared to existing methods such as GPS. Because thesystem can make use of the Sensing Module (603) it is able to interpretnested and hierarchical instructions such as “search for corrosion onthe underside of each support beam.” FIG. 15B depicts an activeconstruction site, where a drone can provide instant inspections thatare precise and available 24/7. A human inspector can monitor the videoand sensory feed or take control from the system if desired, or thesystem is able to autonomously control the drone, recognizing andscoring the construction site's sensed environment for compliance basedon existing models (e.g., local building codes). Other BIM (BuildingInformation Management) applications include persistent monitoring andreporting as well as responsive structures that react to sensed changesin their environment, for example a window washing system that usesconstrained CV to monitor only the exposed panes of glass in a buildingand can intelligently sense the need for cleaning in a specific locationand coordinate an appropriate equipment response, autonomously andwithout human intervention.

Human subjects (1600) can be deconstructed similarly to buildings, asseen in FIG. 16. Beginning with a close and precise measurement of thesubject's body (1601) which can be abstracted into, for example, a pointcloud (1602), composite core rigging (1603) can then be applied suchthat a new composite reference core or base NURB 3D model is created(1604). This deconstruction, atomization, and reconstruction of subjectsallows for precision modeling and the fusing of real and virtual worlds.

In one embodiment, such as a body measurement application for Body MassIndex or other health use-case, fitness application, garment fit orvirtual fitting application, a simpler representation (1605) might becreated and stored at the device for user interface or in a socialsite's datacenters. This obfuscates the subject's body to protect theirprivacy or mask their vivid body model to protect any privacy or social“body image” concerns. Furthermore, data encryption and hash processingof these images can also be automatically applied in the application onthe user's device and throughout the service to protect user privacy andsecurity.

Depending on the output from the Sensing Module (603), the system caneither create a new composition model for the Database (600), or selecta composition model based on attributes deemed most appropriate forcomposition: body type, size, shape, height, arm position, faceposition. Further precise composition body models can be created forprecise direction applications in photo, film, theater, musicalperformance, dance, yoga.

FIG. 17 catalogues some of the models and/or data that can be storedcentrally in a database (600) available to all methods and processesthroughout the system, to facilitate a universal scoring approach forall items. In one example, models for best practices for shooting aclose up movie scene (1702) are stored and include such items as cameraangles, out of focus affects, aperture and exposure settings, depth offield, lighting equipment types with positions and settings, dolly andboom positions relative to the subject (i.e., actor), where “extras”should stand in the scene and entrance timing, set composition. Bysensing and understanding the subjects and contexts of a scene over timevia those models, film equipment can be directed to react in contextwith the entire space. An example is a networked system of cameradollies, mic booms, and lighting equipment on a film set that identifiesactors in a scene and automatically cooperates with other networkedequipment to provide optimal composition dynamically and in real-time,freeing the human director to focus on storytelling.

The Database (600) can also hold 2D images of individuals andcontextualized body theory models (1707), 3D models of individuals(1705), and 2D and 3D models of clothing (1704), allowing the system toscore and correlate between models. In one embodiment, the system canselect an appropriate suit for someone it senses is tall and thin (1705)by considering the body theory and fashion models (1707) as well asclothing attributes (1704) such as the intended fit profile or thenumber of buttons.

The system can keep these models and their individual componentscorrelated to social feedback (1703) such as Facebook, YouTube,Instagram, or Twitter using metrics such as views, likes, or changes infollowers and subscribers. By connecting the system to a number ofsocial applications, a number of use cases could directly providecontext and social proof around identified individuals in a play ormovie, from the overall composition cinematography of a scene in a play,music recital, movie or sports event to how well-received a personalimage (501) or group image or video was (1101). This also continuouslyprovides a method and process for tuning best practice models of alltypes of compositions from photography, painting, movies, skiing,mountain biking, surfing, competitive sports, exercises, yoga poses(510), dance, music, performances.

All of these composition models can also be analyzed for trends fromsocial popularity, from fashion, to popular dance moves and latest formalterations to yoga or fitness exercises. In one example use case, acamera (202) and broad spectrum of hardware (1706), such as lights,robotic camera booms or dollies, autonomous quadcopters, could beevaluated individually, or as part of the overall composition includingsuch items as lights, dolly movements, camera with its multitude ofsettings and attributes.

Referring now to FIG. 18, in one embodiment the system can facilitatethe learning an instrument through the provision of real-time feedback.3D models of an instrument, for example a guitar fretboard model, can besynthesized and used to constrain the CV algorithms so that only thefingers and relevant sections of the instrument (e.g., frets forguitars, keys for pianos, heads for drums) are being analyzed. Using thesubject assignment interface from FIG. 11, each finger can be assigned amarker so that specific feedback can be provided to the user (e.g.,“place 2^(nd) finger on the A string on the 2^(nd) fret.”) in a formatthat is understandable and useful to them. While there are manydifferent ways to learn guitar, no other system looks at the proper hand(1802) and body (1800) position. Because the capture device (110) can benetworked with other devices, instruction can be given holistically andcomplex behaviors and patterns such as rhythm and pick/strum technique(1805) can be analyzed effectively. Models can be created to informbehaviors varying from proper bow technique for violin to proper posturewhen playing keyboard. In one embodiment, advanced composition modelsand challenge models can be loaded into the database, making the systemuseful not just for beginners but anyone looking to improve theirpractice regimens. These models can be used as part of a curriculum toinstruct, test and certify music students remotely. As with FIG. 15, ahuman expert can monitor the process and provide input or the sensing,analyzing, composing and directing can be completely autonomous. Inanother embodiment, renditions and covers of existing songs can also bescored and compared against the original and other covers, providing avideo-game like experience but with fewer hardware requirements andgreater freedom.

FIG. 19 shows an example of a golf swing (1901) to illustrate thepotential of a database of models. Once the swing has been scanned, withthat pre-modeled club or putter, that model is available for immediateapplication and stored in a Database (600). And a plurality of sensedmovements can be synthesized into one, so that leading practice golfswings are sufficiently documented. Once stored, the models can beconverted to compositional models, so that analysis and comparison cantake place between the sensed movements and stored compositional swing,and direction and feedback can be given to the user (1902, 1903).

FIG. 20 is a systematic view of an integrated system for Composition andOrchestration of a 3D or 2.5D space illustrating communication betweenuser and their devices to server through a network (902) or SDN (911,912, 913), according to one embodiment. In one embodiment a user ormultiple users can connect to the Processing Unit (900) that hosts thecomposition event. In another embodiment, the user hardware such as asensor (2001), TV (2003), camera (2004), mobile device such as a tabletor smartphone etc. (2005), wearable (2006), server (2007), laptop (2008)or desktop computer (2009) or any wireless device, or any electronicdevice can communicate directly with other devices in a network or tothe devices of specific users (2002, 902). For example, in oneembodiment the orchestration system might privately send uniquepositions and directions to four separate devices (e.g., watch,smartphone, quadcopter (1706), and an internet-connected TV) in quicklycomposing high-quality and repeatable photographs of actors and fans ata meet-and-greet event.

What is claimed is:
 1. A method, comprising: Capturing a 2D image in aspecific format of an object, subject, and scene using a device; Sensingan object, subject, and scene automatically and continuously using thedevice; Analyzing the 2D image of the object, subject, and scenecaptured to determine the most relevant composition and direction model;Transforming an object, subject, and scene into a 3D model usingexisting reference composition/architecture model; and Storing the 3Dmodel of the scene in a database for use and maintaining it in afeedback loop.
 2. The method of claim 1, further comprising: Performingcontinuous contextual analysis of an image and its resulting 3D model toprovide an update to subsequent 3D modeling processes; and Dynamicallyupdating and responding to contextual analytics performed.
 3. The methodof claim 2, further comprising: Coordinating accurate tracking ofobjects and subjects in a scene by orchestrating autonomous equipmentmovements using a feedback loop.
 4. The method of claim 3, furthercomprising: Controlling the direction of a scene and its subjects viadevices using a feedback loop
 5. The method of claim 4, furthercomprising: Creating and dynamically modifying in real-time the 2D or 3Dmodel for the subject, object, scene, and equipment in any spatialorientation and Providing immediate feedback in a user interface.
 6. Themethod of claim 1, wherein the device is at least one of a camera,wearable device, desktop computer, laptop computer, phone, tablet, andother mobile computer.
 7. A system, comprising: A processing unit thatcan exist on a user device, on-premise, or as an off-premise service tohouse the following modules; A sensing module that can understand thesubjects and context of a scene over time via models; An analyticsmodule that can analyze sensed scenes and subjects to determine the mostrelevant composition and direction models or create them if necessary; Acomposition/architecture module that can simultaneously store thedirection of multiple subjects or objects of a scene according to one ormore composition models; A direction/control module that can providedirection and control to each subject, object, and equipmentindividually and relative to a scene model; and A database that canstore models for use and maintain them in a feedback loop with the abovemodules.